After this lab you will be able to:
manipulate data using base R functions and the dplyr package
use five main git operations: init, add, commit, push, pull
start a project with RStudio and keep it under version control with git
submit your homework with git and github
This lab is adapted from materials by:
Matthew Salganik
Andrew Bray
Jeffrey Arnold (https://github.com/POLS503/pols_503_sp15)
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.
We begin by loading the data set of 20,000 observations into the R workspace. After launching RStudio, enter the following command.
source("http://www.openintro.org/stat/data/cdc.R")
The data set cdc that shows up in your workspace is a data matrix, with each row representing a case and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs.
To view the names of the variables, type the command
names(cdc)
This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.
We can learn most of this information using the str function. The output reveals that there are 20000 cases, 9 variables, and then shows the type of data that each variable is represented by. For instance, the height variable is a numeric variable.
str(cdc)
## 'data.frame': 20000 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ exerany : num 0 0 1 1 0 1 1 0 0 1 ...
## $ hlthplan: num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100: num 0 1 1 0 0 0 0 0 1 0 ...
## $ height : num 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : int 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
## $ age : int 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
We can have a look at the first few entries (rows) of our data with the command
head(cdc)
and similarly we can look at the last few by typing
tail(cdc)
You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. It’s better to take small peeks at the data with head, tail or the subsetting techniques that you’ll learn in a moment.
The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics. As a simple example, the function summary returns a numerical summary: minimum, first quartile, median, mean, second quartile, and maximum. For weight this is
summary(cdc$weight)
R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of weight, type
mean(cdc$weight)
var(cdc$weight)
median(cdc$weight)
We can use the above commands, but there is also a command that makes this very easy. The summary function will show the min, max, and mean for a given variable. In this case, the minimum age in the dataset is 18, the maximum is 99, and the mean is 45.06825.
min(cdc$age)
## [1] 18
max(cdc$age)
## [1] 99
mean(cdc$age)
## [1] 45.06825
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type
table(cdc$smoke100)
or instead look at the relative frequency distribution by typing
table(cdc$smoke100)/nrow(cdc)
Notice how R automatically divides all entries in the table by the number of observations (using the nrow function to produce the number of rows, 20,000, in the cdc data) in the command above. This is similar to something we observed in Lab 1, when we multiplied or divided a vector with a number, R applied that action across entries in the vectors. As we see above, this also works for tables.
Also, in Lab 1, we explored the use of logical attributes (TRUE/FALSE). You can use this functionality to explore data as well. For instance, the following code produces a TRUE response for every respondent who 6 ft (72 inches) or more in height, and then summarizes the number of respondents in each category:
summary(cdc$height>=72)
We can simply switch change the name and number from the line of code above:
summary(cdc$weight>=175)
## Mode FALSE TRUE NA's
## logical 11651 8349 0
another clean way to do this is with the table command:
table(cdc$weight>=175)
##
## FALSE TRUE
## 11651 8349
R will also treat TRUE/FALSE variables as a 1/0 binary variable if you want it to, so you could use:
mean(cdc$weight>=175)
## [1] 0.41745
to get the proportion of observations with weight greater than 175.
height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?The summary command is again the workhorse here, as it will show the interquartile range for each variable:
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
You could also use the IQR() function to produce the same information. This format might be easier if you were taking the value and doing something else with it like inserting it into a table:
IQR(cdc$height)
## [1] 6
IQR(cdc$age)
## [1] 26
The table command can be used to tabulate any number of variables that you provide. For example, to examine which participants have smoked across each gender, we could use the following.
table(cdc$gender,cdc$smoke100)
We mentioned that R stores data in data frames, which you might think of as a type of spreadsheet. Each row is a different observation (a different respondent) and each column is a different variable (the first is genhlth, the second exerany and so on). We can see the size of the data frame next to the object name in the workspace or we can type
dim(cdc)
which will return the number of rows and columns. Now, if we want to access a subset of the full data frame, we can use row-and-column notation. For example, to see the sixth variable of the 567th respondent, use the format
cdc[567,6]
which means we want the element of our data set that is in the 567th row (meaning the 567th person or observation) and the 6th column (in this case, weight). We know that weight is the 6th variable because it is the 6th entry in the list of variable names:
names(cdc)
To see the weights for the first 10 respondents we can type
cdc[1:10,6]
In this expression, we have asked just for rows in the range 1 through 10. R uses the : to create a range of values, so 1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You can see this by entering
1:10
Finally, if we want all of the data for the first 10 respondents, type
cdc[1:10,]
By leaving out an index or a range (we didn’t type anything between the comma and the square bracket), we get all the columns. When starting out in R, this is a bit counterintuitive. As a rule, we omit the column number to see all columns in a data frame. Similarly, if we leave out an index or range for the rows, we would access all the observations, not just the 567th, or rows 1 through 10. Try the following to see the weights for all 20,000 respondents fly by on your screen
cdc[,6]
Recall that column 6 represents respondents’ weight, so the command above reported all of the weights in the data set. An alternative method to access the weight data is by referring to the name. Previously, we typed names(cdc) to see all the variables contained in the cdc data set. We can use any of the variable names to select items in our data set.
cdc$weight
The dollar-sign tells R to look in data frame cdc for the column called weight. Since that’s a single vector, we can subset it with just a single index inside square brackets. We see the weight for the 567th respondent by typing
cdc$weight[567]
Similarly, for just the first 10 respondents
cdc$weight[1:10]
The command above returns the same result as the cdc[1:10,6] command. Both row-and-column notation and dollar-sign notation are widely used, which one you choose to use depends on your personal preference.
It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We accomplish this through conditioning commands.
First, consider expressions like
cdc$gender == "m"
or
cdc$age > 30
These commands produce a series of TRUE and FALSE values. There is one value for each respondent, where TRUE indicates that the person was male (via the first command) or older than 30 (second command).
NA: not available, missing
NULL: does not exist, is undefined
TRUE, T: logical true. Logical is also an object class.
FALSE, F: logical false
| Function | Meaning |
|---|---|
is.na |
Is the value NA |
is.null |
Is the value NULL |
isTRUE |
Is the value TRUE |
!isTRUE |
Is the value FALSE |
absent <- NA
is.na(absent)
## [1] TRUE
Misssing data is particularly important
foo <- c(1, 2, NA, 3, 4)
Missing Data Challenge
2 + NAmean(foo)mean to change how that function handles missing values.median(foo) work?foo > 2. Are all the entries TRUE and FALSE?is.na(foo) do? What about ! is.na(foo) ?foo[! is.na(foo)] do?#1
2 + NA #anything +-*/ operation involving NA will return an NA
mean(foo) #gives an error (because it involves +-*/ operations with an NA value)
mean(foo,na.rm=T) # works because it drops NA value before computing mean
median(foo) #again, chokes on NA value
median(foo,na.rm=T) #good to go
foo > 2 #returns TRUE/FALSE for all non-NA value, NA for NA value
is.na(foo) #returns TRUE when is.na==TRUE, FALSE when is.na==FALSE
!is.na(foo) #inverse: returns TRUE when is.na==FALSE, FALSE when is.na==TRUE
foo[!is.na(foo)] #prints all values of foo that are not NA value
The function na.omit is particularly useful. It removes any row in a dataset with a missing value in any column. For example,
dfrm <- data.frame(x = c(NA, 2, NA, 4), y = c(NA, NA, 7, 8))
na.omit(dfrm)
## x y
## 4 4 8
< less than
<= less than or equal
> greater than
>= greater than or equal
== exactly equal to
!= not equal to
!x not equal to x
x | y x or y
x & y x and y
x <- c(12, 15, 8, 11, 24)
i <- c(F, F, T, F, F)
x[i]
## [1] 8
which(x < 10)
## [1] 3
x[x < 10] <- 10
x
## [1] 12 15 10 11 24
Usefel facts:
i <- c(F, F, T, F, F)
sum(i)
## [1] 1
mean(i)
## [1] 0.2
x <- c(12, 15, 8, 11, 24)
mean(x > 11)
## [1] 0.6
Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can use the R function subset to do that for us. For example, the command
mdata <- subset(cdc, cdc$gender == "m")
will create a new data set called mdata that contains only the men from the cdc data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual
head(mdata)
This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables, which is a topic we’ll discuss in a future lab. For now, the important thing is that we can carve up the data based on values of one or more variables.
You can use several of these conditions together with & and |. The & is read “and” so that
m_and_over30 <- subset(cdc, gender == "m" & age > 30)
will give you the data for men over the age of 30. The | character is read “or” so that
m_or_over30 <- subset(cdc, gender == "m" | age > 30)
will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right now the mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses as you like when forming a subset.
under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.# new data = cdc obs where age <= 23 AND smoke100 == 1
under23.smokers = subset(cdc,cdc$age<=23&cdc$smoke100==1)
Hadley Wickham (assistant professor of statistics at Rice, major R guru, and all-around good guy) has a knack for developing packages that make life a lot easier for R users. Two of his best packages are plyr and dplyr. Dplyr in particular is an excellent way to work with data.
Load (and if necessary install) plyr and dplyr. Note: you should always load plyr before dplyr, as the two packages have some overlapping functions that respond best when plyr is loaded first.
#install.packages(c('plyr','dplyr'))
library(dplyr)
In Hadley’s own words, the dplyr package makes data manipulation fast and easy by:
I. Elucidating the most common data manipulation operations, so that your options are helpfully constrained when thinking about how to tackle a problem.
Providing simple functions that correspond to the most common data manipulation verbs, so that you can easily translate your thoughts into code.
Using efficient data storage backends, so that you spend as little time waiting for the computer as possible.
dplyr provides a few core functions data manipulation. Most data manipulations can be done by combining these verbs together — something which becomes even easier with the %>% operator.
filter(): subset observations by logical conditionsslice(): subset observations by row numbersarrange(): sort the data by variablesselect(): select a subset of variablesrename(): rename variablesdistinct(): keep only distict rowsmutate() and transmute(): adds new variablesgroup_by(): group the data according to variablessummarise(): summarize multiple values into a single valuesample_n() and sample_frac(): select a random sample of rowsAt the outset, let’s focus on single-table data. These are data that are in a single dataframe or that you might find in a csv file or Excel sheet. Install and load the “Lahman” package, which contains baseball statistical data from Sean Lahman, and load the “batting” data set included in the package.
#install.packages('Lahman',repos = "http://cran.us.r-project.org")
library(Lahman)
data(Batting)
Use our old (or somewhat new) friend the str function:
str(Batting)
## 'data.frame': 97889 obs. of 24 variables:
## $ playerID : chr "aardsda01" "aardsda01" "aardsda01" "aardsda01" ...
## $ yearID : int 2004 2006 2007 2008 2009 2010 2012 1954 1955 1956 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 117 35 33 16 116 116 93 80 80 80 ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 5 5 2 2 2 2 2 5 5 5 ...
## $ G : int 11 45 25 47 73 53 1 122 153 153 ...
## $ G_batting: int 11 43 2 5 3 4 NA 122 153 153 ...
## $ AB : int 0 2 0 1 0 0 NA 468 602 609 ...
## $ R : int 0 0 0 0 0 0 NA 58 105 106 ...
## $ H : int 0 0 0 0 0 0 NA 131 189 200 ...
## $ X2B : int 0 0 0 0 0 0 NA 27 37 34 ...
## $ X3B : int 0 0 0 0 0 0 NA 6 9 14 ...
## $ HR : int 0 0 0 0 0 0 NA 13 27 26 ...
## $ RBI : int 0 0 0 0 0 0 NA 69 106 92 ...
## $ SB : int 0 0 0 0 0 0 NA 2 3 2 ...
## $ CS : int 0 0 0 0 0 0 NA 2 1 4 ...
## $ BB : int 0 0 0 0 0 0 NA 28 49 37 ...
## $ SO : int 0 0 0 1 0 0 NA 39 61 54 ...
## $ IBB : int 0 0 0 0 0 0 NA NA 5 6 ...
## $ HBP : int 0 0 0 0 0 0 NA 3 3 2 ...
## $ SH : int 0 1 0 0 0 0 NA 6 7 5 ...
## $ SF : int 0 0 0 0 0 0 NA 4 4 7 ...
## $ GIDP : int 0 0 0 0 0 0 NA 13 20 21 ...
## $ G_old : int 11 45 2 5 NA NA NA 122 153 153 ...
At a basic level, dplyr provides you with five tools that you can use to work with a single data table. These tools are:
The filter command subsets a data table based upon each observed value. The following code filters out only Seattle Mariner’s players:
filter(Batting, teamID=='SEA')
To save new filtered table, need to record it as an object by assigning the value:
mariners <- filter(Batting, teamID=='SEA')
You can also filter based upon multiple attributes or for multiple values:
#only Seattle Mariners players in 2010
filter(Batting, teamID=='SEA' & yearID==2010)
#either Mariners players or San Diego Padres players
filter(Batting, teamID=='SEA'|teamID=='SDN')
The filter() function will retain all rows for which the logical query you specify is TRUE. Thus, you can also us != to filter based upon values that do not equal the specified value.
#keep all players EXCEPT Seattle Mariners players
filter(Batting, teamID!='SEA')
#keep all ATL players for year 2000 and after
#note that unlike subset command, filter does not require you to use the & sign to build a full logical statement
atl.post2000 <- filter(Batting, teamID=='ATL', yearID>=2000)
The arrange command provides an easy way to sort observations. By default, the smallest value will go at the top; you can invert this using the “-”.
#sort by number of home runs, lowest to highest
arrange(Batting, HR)
#sort by number of home runs, highest to lowest
arrange(Batting, -HR)
You can also sorty by multiple attributes at the same time, including character values:
#sort by number of home runs, highest to lowest, and then by team.
arrange(Batting, -HR, teamID)
arrange(Batting, -X2B)[1:3,]
## playerID yearID stint teamID lgID G G_batting AB R H X2B X3B HR
## 1 webbea01 1931 1 BOS AL 151 151 589 96 196 67 3 14
## 2 burnsge02 1926 1 CLE AL 151 151 603 97 216 64 3 4
## 3 medwijo01 1936 1 SLN NL 155 155 636 115 223 64 13 18
## RBI SB CS BB SO IBB HBP SH SF GIDP G_old
## 1 103 2 2 70 51 NA 0 1 NA NA 151
## 2 114 13 7 28 33 NA 8 18 NA NA 151
## 3 138 3 NA 34 33 NA 4 3 NA 14 155
It looks like some really old duds in 1931 (A Webbe), 1926 (Ge Burns), and 1936 (J Medwick) are the top three singe-season doubles hitters of all time. Note that I used the bracket index [1:3,] to print only the top three, so I didn’t have to scroll through all of the observations to get back to the top.
The mutate command is used to generate new variables in the data table or to edit existing variables. For instance, we can create a new category (e.g., stolen bases “SB” + home runs (“HR”) )
mutate(Batting, SBHR = SB + HR)
or modify an existing variable:
mutate(Batting, RBI = RBI + 1000)
One important thing to remember is that these changes will not be stored as part of the original object. Thus, you have to assign the mutated data table to an object name (either the same name or a new name):
Batting = mutate(Batting, SBHR = SB + HR)
My new variable is at-bats per game, or the average number of time a player came up to bat in a game that season:
Batting = mutate(Batting, ABpergame = AB/G)
#select player id, team id, at bats, and homeruns
select(Batting, playerID,teamID,AB,HR)
sometimes, you want to drop one or two columns and keep the rest. It can be incredibly cumbersome to insert all the names that you want to keep. Instead, you can invert the select() function, again using the “-” sign, to drop a specified variable:
#drop homeruns from datatable
select(Batting, -HR)
#drop G and G_batting
batting.subset = select(Batting, -G, -G_batting)
names(batting.subset) #see, G and G_batting are gone!
## [1] "playerID" "yearID" "stint" "teamID" "lgID"
## [6] "AB" "R" "H" "X2B" "X3B"
## [11] "HR" "RBI" "SB" "CS" "BB"
## [16] "SO" "IBB" "HBP" "SH" "SF"
## [21] "GIDP" "G_old" "ABpergame"
Finally, the summarise command can be used to generate summary statistics. For instance, you can compute the mean or median of a given variable. Note that the summarise command is slightly different than some other operations, in that the variable you want to summarise must be called within the function you want.
summarise(Batting,mean(HR,na.rm=T))
Note also that you need to select na.rm=T within the mean function, otherwise the function will choke on the presence of NA values. Setting na.rm=T tells R to ignore NA values when computing the variable mean. You can request multiple summaries:
summarise(Batting,mean(HR,na.rm=T),mean(SB,na.rm=T))
summarise(Batting,mean(HR,na.rm=T),sd(HR,na.rm=T))
or, you can use the summarise_each() function to do the same thing. Notice that the “.” stands in for each variable, and then you select na.rm=T within each function.
summarise_each(Batting,funs(min(.,na.rm=T),max(.,na.rm=T)),HR,SB)
You might be wondering what the point is of the summarise and summarise_each functions since as-of-yet we have used them to do operations that we can already do quite easily with base functions such as min(), max(), and mean() (e.g., mean(Batting$HR,na.rm=T)).
The group_by function is a key addition that greatly multiplies the power of dplyr, as it allows us to compute grouped summary values, for instance the maximum number of homeruns hit in a season for each team:
summarise(group_by(Batting,teamID),mean(HR,na.rm=T))
## Source: local data frame [149 x 2]
##
## teamID mean(HR, na.rm = T)
## 1 ALT 0.1176471
## 2 ANA 4.6784452
## 3 ARI 3.7345254
## 4 ATL 3.7478216
## 5 BAL 4.7729167
## 6 BFN 0.8606557
## 7 BFP 0.7692308
## 8 BL1 0.5000000
## 9 BL2 0.6903553
## 10 BL3 0.8888889
## .. ... ...
summarise(group_by(Batting,teamID),max(SB,na.rm=T),max(X3B,na.rm=T))
## Source: local data frame [149 x 3]
##
## teamID max(SB, na.rm = T) max(X3B, na.rm = T)
## 1 ALT 0 2
## 2 ANA 34 17
## 3 ARI 72 14
## 4 ATL 72 17
## 5 BAL 57 12
## 6 BFN NA 17
## 7 BFP 39 12
## 8 BL1 8 9
## 9 BL2 94 19
## 10 BL3 75 18
## .. ... ... ...
We can assign a name to the new summary variable so the output looks nicer:
summarise(group_by(Batting,teamID),maxSB = max(SB,na.rm=T),max3b = max(X3B,na.rm=T))
## Source: local data frame [149 x 3]
##
## teamID maxSB max3b
## 1 ALT 0 2
## 2 ANA 34 17
## 3 ARI 72 14
## 4 ATL 72 17
## 5 BAL 57 12
## 6 BFN NA 17
## 7 BFP 39 12
## 8 BL1 8 9
## 9 BL2 94 19
## 10 BL3 75 18
## .. ... ... ...
Finally, perhaps the coolest feature of dplyr is that you can daisy chain functions together. dplyr imports a special function `%>%’ from the magrittr package to do this. Typically, when working with data you will perform a series of operations; dplyr allows you to link these operations together without needing to generate a series of intermediate objects.
Basically, you do this by starting with the data table (Batting in this case) and then using the “%>%” operator to link functions. Within a chain, you do not need to place the data table name within each operation function:
Batting %>% filter(AB>400,yearID>1990) %>% group_by(teamID) %>% summarise(max(SB))
H) made in a single season by a player on each team who had at least 400 at-bats (AB) in a single season since year 2000.Batting %>% filter(AB>=400,yearID>=2000) %>% summarise(min(H))
## min(H)
## 1 66
filter, select and and slice show only year and homeruns of Seattle Mariners players for the first two observations (i.e., just the first two rows)# data %>% drop obs where teamID != SEA %>% choose yearID, HR %>% print first 2 rows
Batting %>% filter(teamID=='SEA') %>% select(yearID,HR) %>% slice(1:2)
## yearID HR
## 1 2009 0
## 2 2010 0
We will be using the graphics package ggplot2, which is one of the most popular, but it is only one of several graphics packages in R.[^1]
Unlike many other graphics systems, functions in ggplot2 do not correspond to separate types of graphs. There are not scatterplot, histogram, or line chart functions per se. Instead plots are built up from component functions.
Install and load the gapminder package, which provides an excerpt of data from Gapminder.org concerning worldwide development statistics. Don’t forget to load ggplot2 as well!
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp))
This gives an error message because there is nothing to plot yet!
This just initializes the plot object, it is better if you assign it to an object, p is good enough
p <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp))
Now we can add geoms
p + geom_point()
That look okay but it would probably look be better if we log transform
p_l <- ggplot(gapminder, aes(x = log10(gdpPercap), y = lifeExp))
p_l + geom_point()
A better way to log transform
p + geom_point() + scale_x_log10()
Let’s make that stick
p <- p + scale_x_log10()
Common workflow: gradually build up the plot you want, re-define the object ‘p’ as you develop “keeper” commands. Note that in the reassigning we excluded the geom. Now, set the contenent variable to the aesthetic category color:
p + geom_point(aes(color = continent))
In full detail, up to now:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10()
Let’s address over-plotting: SET alpha transparency and size to a value
p + geom_point(alpha = (1 / 3), size = 3)
Add now a fitted curve or line
p + geom_point() + geom_smooth()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
p + geom_point() + geom_smooth(lwd = 2, se = FALSE)
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
p + geom_smooth(lwd = 1, se = FALSE, method = "lm") + geom_point()
That’s great but I actually want to revive our interest in continents!
p + aes(color = continent) + geom_point() + geom_smooth(lwd = 3, se = FALSE)
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
Facetting: another way to exploit a factor
p + geom_point(alpha = (1 / 3), size = 3) + facet_wrap(~ continent)
Still want lines? Let’s add them
p + geom_point(alpha = (1 / 3), size = 3) + facet_wrap(~ continent) +
geom_smooth(lwd = 2, se = FALSE)
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
Challenge
# plot lifeExp against year
y <- ggplot(gapminder, aes(x = year, y = lifeExp)) + geom_point()
plot(y)
#make mini-plots, split out by continent
y + facet_wrap(~ continent)
# add a fitted smooth and/or linear regression, w/ or w/o facetting
y + geom_smooth(se = FALSE, lwd = 2) +
geom_smooth(se = FALSE, method ="lm", color = "orange", lwd = 2)
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
y + geom_smooth(se = FALSE, lwd = 2) +
facet_wrap(~ continent)
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
What if I am only interrested in the US?
ggplot(filter(gapminder, country == "United States"),
aes(x = year, y = lifeExp)) +
geom_line() +
geom_point()
Let just look at five countries
some_countries <- c("United States", "Canada", "Rwanda", "Cambodia", "Mexico")
ggplot(filter(gapminder, country %in% some_countries),
aes(x = year, y = lifeExp, color = country)) +
geom_line() +
geom_point()
So what’s up with Mexico?
Not really…
ggplot(subset(gapminder, country %in% some_countries),
aes(x = year, y = lifeExp, color = country)) +
geom_line() +
geom_point(aes(size=gdpPercap))
You can change the way the plot looks overall using theme
ggplot(subset(gapminder, country %in% some_countries),
aes(x = year, y = lifeExp, color = country)) +
geom_line() +
geom_point(aes(size=gdpPercap)) +
theme_minimal()
In addition to the themes included with ggplot, several other themes are available in the ggthemes package.
Live demo
Questions?
Live demo (http://spia.uga.edu/faculty_pages/tyler.scott/teaching/PADP8120_Fall2015/Homeworks/submitting_homework.shtml)
Questions?
Goal check
motivation for next class